With recent innovations in dense image captioning, it is now possible todescribe every object of the scene with a caption while objects are determinedby bounding boxes. However, interpretation of such an output is not trivial dueto the existence of many overlapping bounding boxes. Furthermore, in currentcaptioning frameworks, the user is not able to involve personal preferences toexclude out of interest areas. In this paper, we propose a novel hybrid deeplearning architecture for interactive region segmentation and captioning wherethe user is able to specify an arbitrary region of the image that should beprocessed. To this end, a dedicated Fully Convolutional Network (FCN) namedLyncean FCN (LFCN) is trained using our special training data to isolate theUser Intention Region (UIR) as the output of an efficient segmentation. Inparallel, a dense image captioning model is utilized to provide a wide varietyof captions for that region. Then, the UIR will be explained with the captionof the best match bounding box. To the best of our knowledge, this is the firstwork that provides such a comprehensive output. Our experiments show thesuperiority of the proposed approach over state-of-the-art interactivesegmentation methods on several well-known datasets. In addition, replacementof the bounding boxes with the result of the interactive segmentation leads toa better understanding of the dense image captioning output as well as accuracyenhancement for the object detection in terms of Intersection over Union (IoU).
展开▼